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The  research  objectives  of  this  project  were  to  create  new  mathematical  tools  for 
understanding  different  kinds  of  information  networks,  especially  the  dynamics 
thereof  and  also  to  import  tools  from  geometry  to  analyze  network  dynamics.  In 
particular,  we  aimed  to  create  new  mathematical  frameworks  for  visualizing  and 
teasing  apart  multiscale  network  dynamics.  We  see  this  as  extremely  relevant  for 
the  analysis  of  large  document  corpora.  The  primary  technical  approach  exploits 
ideas  from  linear  algebra,  markov  processes,  diffusion  networks,  differential 
geometry,  and  machine  learning. 

Project  1:  the  deployment  of  the  partition  decoupling  method  [PDM]1  in  novel 
contexts  to  understand  hierarchical  structures  in  various  networks. 

The  PDM  is  an  analytic  tool  that  is  roughly  a  form  of  supervised  learning  that 
enables  the  articulation  of  multiscale  structure  in  networks.  In  project  related  work 
we  showed  that  the  PDM  could  be  used  to  discover  (and  novel)  hidden  structures  in 
a  variety  of  different  network  contexts:  voting  networks,  personality  data,  and  cell 
signaling  data.  This  work  resulted  in  three  publications  in  diverse  journals. 

PDM  and  the  SCN  (supraschismatic  celluar  nucleus): 

•  S .  Pauls ,  Foley  NC,  Foley  DK,  LeSauter  J,  Hastings  MH,  Maywood  ES,  Silver  R, 
Differential  contributions  of  intra-and  inter-cellular  mechanisms  to  spatial  and 
temporal  architecture  of  the  suprachiasmaticnucleus  circadian  circuitry  in  wild- 
type,  CRY-and  VPAC2  -null  mutant  mice.  Eur.  J.  Neuroscience.  40:3  (2014), 
2528-2540. 

The  SCN  produces  various  signals  that  generate  the  circadian  rhythm,  which  can  be 
observed  through  measuring  concentrations  of  various  proteins,  gene  expressions, 
etc.  In  this  way  the  SCN  is  a  living  network  of  information  transfer,  but  one  that  is 
not  well  understood.  Particular  questions  of  interest  for  this  study  is  the  mechanism 
by  which  the  SCN  1)  maintains  a  coherent  rhythm  over  time  and  2)  adjusts  the 


1  G.  Leibon,  S.  Pauls,  D.  Rockmore,  and  R.  Saveli,  Topological  structures  in  the  equities  market 
network  PNAS  2008  105  (52)  20589-20594;  published  ahead  of  print  December  22,  2008, 
doi : 10. 107 3 Ipnas. 0802806 106 


rhythm  to  reflect  external  stimulus  (e.g.  overcoming  jet  lag,  or  the  more  gradual 
adjustment  to  seasonal  changes). 

Using  the  PDM  we  tested  the  hypothesis  is  that  different  areas  of  the  tissue  of  the 
SCN  activate  in  a  sequence  guided  by  both  chemical  and  spatial  factors.  This 
sequential  activation  is  what  yields  the  robust  circadian  rhythm  as  well  as  provides 
a  substrate  on  which  the  changes  can  be  made.  The  PDM  isolated  the  areas  of  the 
tissue  that  form  the  sequence  of  activation  -  spectral  clustering  is  basically  perfect 
for  this.  We  are  able  to  detect  the  coherent  sets  of  tissue  under  the  hypothesis  that 
1)  we  have  sequential  activation  of  the  tissue  and  2)  that  there  exists  an 
(unobserved)  communication  network  among  the  cells  that  facilitates  the  regulation 
of  the  signal.  This  work  suggested  many  new  lines  of  current  research. 
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Figure  1.  A  still  from  a  movie  showing  that  for  the  SCN  in  wild  type  (WT)  mice,  shows  the  individual 
cell  oscillatory  behavior  -  sinusoidal  -  but  that  the  clusters  of  tissue  that  activate  sequentially  are 
spatially  contiguous  and  activate  regularly  in  order  over  several  days  of  recording. 


B.  PDM  and  Voting  networks: 

•  S.  Pauls,  G.  Leibon,  and  D.  Rockmore,  The  social  identity  voting  model:  Ideology 
and  community  structures  Research  &  Politics  2015  2  (2)  2053168015570415 

Voting  "networks"  -  which  is  to  say  the  networks  created  by  the  correlation 
structure  in  voting  records  -  can  be  used  to  uncover  the  various  methods  of 


influence  in  the  information  structure  of  a  legislative  body,  and  with  that  the 
communities  of  influence  and  ideology  in  a  society.  We  have  a  new  model  of  voting 
behavior  based  on  "social  identity  theory,"  a  social  sciences  framework  that  says 
that  identity  is  construct  partially  created  and  reinforced  by  the  various  circles 
within  which  one  moves  and  to  what  degree  one  belongs.  In  the  context  of  voting, 
we  take  a  the  body  of  legislator  voting  records  (the  "roll  call"  votes)  in  a  given 
Congress  to  learn  the  social  circles  and  then  reconstruct  a  voting  record  as 
essentially  a  weighted  combination  of  "ideal  voters”  per  social  circle.  The  standard 
model  -  by  Poole  and  Rosenthal  -  basically  assumes  that  the  there  is  only  one  factor 
that  matters:  where  a  legislator  lives  on  the  liberal-conservative  axis.  We  find  that 
our  model  is  a  much  more  accurate  representation  of  the  voting  behavior  (as 
represented  by  the  record).  It  also  gives  us  the  ability  to  drill  down  on  the  record 
and  discover  more  interesting  influences  working  within  the  parties.  Our  best 
example  is  a  deeper  look  at  the  Tea  Party  in  112th  Congress  whose  members  are 
best  distinguished  by  differing  opinions  on  foreign  policy  and  defense 
appropriations. 

C.  PDM  and  Personality 

•  T.  Bates,  S.  Brocklebank,  S.  Pauls,  and  D.Rockmore,  A  spectral  clustering 
approach  to  the  structure  of  personality:  contrasting  the  FFM  and  HEXACO 
models,  Journal  of  Research  in  Personality,  Volume  57,  August  2015,  Pages 
100-109. 

Two  basic  questions  in  personality  research  concern  the  dimensionality  of 
personality  -  which  is  widely  believed  to  be  either  5  or  6  -  and  the  content  of  those 
dimensions.  Debate  over  these  questions  has  largely  hinged  on  results  from  factor 
analysis  of  questionnaire  data.  Here  we  use  the  methodology  of  spectral  clustering  - 
a  key  piece  of  the  PDM  -  to  test  the  structure  of  personality  and  compare  the  results 
with  those  from  factor  analysis.  Our  studies  give  unambiguous  support  for  a  six- 
domain  solution  using  spectral  clustering.  Spectral  clustering  provides  a  valuable 
function  in  situations  where  few  if  any  items  have  strong  loadings  on  a  domain.  In 
addition  to  support  for  a  sixth  domain  of  "Honesty-Humility”  the  results  also  refocus 
the  conventional  five  domains  in  important  ways,  which  are  discussed. 

Project  2:  New  paradigms  for  information  networks.  In  this  work  we  bring 
techniques  from  geometry  to  give  alternative  ways  for  the  understanding 
information  networks.  We  are  interested  in  the  evolution  as  well  as  navigation 
(search)  of  such  networks,  especially  in  the  case  of  networks  built  on  text,  which  is 
often  the  basic  material  on  which  information  networks  are  built.  In  the  case  of  text, 
the  scenario  is  that  there  is  a  document  corpus  wherein  the  documents  are 
conceptually  linked  (e.g.,  research  papers  in  a  given  research  area,  legal  documents). 
The  big  questions  that  are  then  studied  are  (1)  can  we  find  ways  to  trace  the  flow  of 
ideas  in  the  documents  (articulate  inheritance  and  transmission  of  ideas  )(2) 
Measure  effects  of  the  insertion  of  new  documents  into  the  corpus  (3)  can  we  find 


ways  to  better  articulate  the  influence  of  documents  or  ideas  and  even  predict  what 
kinds  of  documents/ideas  have  a  good  chance  of  being  influential. 


A.  Networks  with  Direction 

•  "Orienteering  in  Knowledge  Spaces:  The  Hyperbolic  Geometry  of  Wikipedia 
Mathematics,:  G.  Leibon  and  D.  Rockmore,  PLoS  ONE,  2013. 

In  this  work  we  introduce  the  notion  of  a  "network  with  directions”.  This  is  a 
network  with  a  preferenced  set  of  nodes  ("directions”)  along  with  a  new  metric  (not 
the  usual  path  length  metric)  that  is  based  on  the  idea  of  a  "four-point  probe” 
(inspired  by  a  methodology  in  materials  science)  that  builds  on  the  well-known 
connections  between  network  structure  and  electrical  network  theory.2  The  point  of 
this  work  is  to  create  a  metric  that  better  reflects  the  notion  of  "exploration",  one 
that  includes  the  idea  of  nodes  having  different  global  properties  (as  perhaps 
encoded  in  metadata)  such  that  given  that  one  starts  at  a  particular  node  in  the 
network,  it  makes  sense  to  search  in  a  direction  of  inquiry,  rather  than  simple 
nearest  neighbor  path  length  exploration.  The  four-point  probe-based  metric  gives  a 
hyperbolic  structure  to  the  associated  geometry  and  is  of  interest  in  its  own  right. 

Our  basic  example  in  this  paper  is  the  MathWiki  -  the  pages  of  Wikipedia  devoted  to 
mathematics.  In  this  case  the  "directions"  are  the  "list_of"  pages  -  those  pages  that 
contain  all  the  links  for  math  pages  associated  with  a  given  topic  (e.g.,  geometry,  set 
theory,  etc.).  While  the  list_of  pages  make  it  possible  to  find  extremely  short  link- 
paths  from  one  page  to  another,  and  hence  from  one  concept  to  another,  the  link- 
distance  does  not  at  all  reflect  the  conceptual  distance  between  topics.  We  believe 
that  the  distance  based  on  the  four-point  probe  -  along  with  the  conceptual 
directions  defined  by  the  list_of  pages,  which  act  as  "points  at  infinity"  in  the 
associated  hyperbolic  metric  -  does  a  much  better  job  of  articulating  conceptual 
distance.  This  is  further  encoded  in  the  geodesic  bundles  produced  under  this 
metric,  that  produce  conceptual  paths  between  pages.  Our  next  goal  is  investigate  a 
deployment  of  this  geometry  on  a  space  of  several  hundred  thousand  judicial 
decisions,  in  order  to  understand  the  geometry  of  ideas  in  the  law. 


2  See  e.g.,  P.  G.  Doyle  and  J.  L.  Snell,  "Random  Walks  and  Electrical  Networks,"  Carus  Mathematical 
Monographs,  Vol.  22,  MAA,  1984. 
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Figure  2.  On  the  left  is  a  cartoon  of  what  the  space  of  WikiMath  looks  like  under  the  four-point  probe 
geometry.  Notice  the  curved  hyperbolic  triangle  defined  by  the  three  (finite(  points  (concepts,  pages) 
in  the  space.  On  the  right  is  the  MDS  placement  of  the  space  with  the  pages  indicated.  Notice  how  the 
(thin)  hyperbolic  triangle  is  preserved,  as  well  as  the  many  points  between  them,  representing 
"intermediate"  concepts. 


C.  The  Evolution  of  National  Constitutions 

•  The  Cultural  Evolution  of  National  Constitutions,  D.  Rockmore,  D.  Krakauer, 
T.  Ginsburg,  N.  Foti,  and  C.  Feng,  under  review  (at  PNAS ). 

In  this  work  we  introduce  a  general  methology  composed  of  a  hybrid  of  approaches 
inspired  by  biology  and  genetics,  to  analyze  general  patterns  of  cultural  inheritance 
and  innovation,  in  the  context  a  sample  of  99  of  the  approximately  600  English 
constitutional  texts  (translations  where  necessary),  spanning  1787-2008,  available 
as  part  of  the  Comparative  Constitutions  Project.3  In  this  setting  it  takes  the  form  of 
a  study  of  the  diffusion  of  ideas  as  represented  in  this  time-stamped  text  corpus.  We 
use  the  basic  information  derived  from  a  topic  modeling  of  the  corpus  to  construct 
cultural  diffusion  trees  (a  specific  form  of  diffusion  networks)  to  characterize 
constitutions  as  cultural  recombinants  borrowing  from  ancestral  constitutions  back 
to  the  Last  Universal  Common  Ancestor  of  Constitutions  (LUCAC),  the  US 
Constitution  of  1787.  Among  the  discoveries  we  make  is  that  constitutions  cluster 
into  three  epochs  within  which  concepts  are  frequently  shared.  Natural  metrics 
from  the  diffusion  network  setting  give  a  basic  taxonomy  of  constitutions  reflecting 
the  degree  to  which  they  borrow  and  transmit  concepts.  This  framework  supports 
the  notion  that  culture  does  in  fact  support  a  genetic  and  particulate  structure  but 
one  with  significant  variation  in  the  basic  patterns  of  descent.  The  methodology  is 
quite  general  and  could  be  applied  to  other  kinds  of  text  corpora  or  cultural  or 
media  artifacts. 


3  comparativeconstitutionsproject.org 


Figure  3.  The  diffusion  tree  on  a  sample  of  99  constitutions  over  the  years  1787-2008,  built  from  an 
LDA  topic  modeling  of  the  corpus. 
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Figure  4.  Clustering  of  the  constitutional  data  set.  The  clusters  are  temporally  localized  with  eras 
1787,  1970  (upper  right),  1971-1990  (middle),  and  1991-2008  (upper  left).  Important  "bridging” 
constitutions  are  Spain  (1931)  and  Haiti  (1987),  acknowledged  constitutional  innovators. 


D.  Knowledge  Networks  as  Landscapes. 


•  Bending  the  Law,  G.  Leibon,  M.  Livermore,  A.  Riddell,  and  D.  Rockmore, 
preprint  (2015).  [Aiming  for  journal  submission  in  fall  2015.] 


In  this  work  we  once  again  consider  the  context  of  an  evolving  knowledge  corpus.  Of 
interest  here  is  how  certain  ideas  gain  traction  (have  influence)  while  others  fail  to 
gain  traction  or  fall  out  of  favor.  We  see  this  as  an  interaction  between  the  way  the 
action  of  search  in  the  corpus  and  the  basic  connectivity  and  similarity  between  the 
entities  in  the  corpus.  The  particular  context  of  interest  is  once  again  a  corpus  of 
documents,  in  this  case  judicial  opinions,  which  reference  each  other  and  also  have 
textual  similarity  quantified  through  topic  similarity.  Standard  search  mechanisms 
for  this  corpus  are  driven  by  the  basic  connectivity  of  the  citation  network.  We 
extend  the  associated  random  walk  to  include  textual  similarity  as  well.  Given  the 
random  walk  we  define  a  notion  of  curvature  for  the  network  (space)  that  depends 
on  a  new  notion  of  distance  in  the  space  (determined  through  the  use  of  the  random 
walk).  A  point  of  "high  curvature"  is  difficult  to  escape  from.  This  in  turn  allows  us  to 
define  a  notion  of  "bending”  -  a  temporal  measure  of  the  change  in  curvature  over 
time  at  a  point.  In  short  what  we  discover  is  that  points  and  regions  around  the 
curvature  increases  (becomes  more  positive)  tend  to  be  areas  of  "puddling"  in  the 
sense  that  in  the  future,  they  become  less  influential,  whereas  areas  where  the 
decreases  (so  become  increasingly  negative  and  more  saddle  point-like)  tend  to  be 
more  influential  in  the  future.  These  are  regions  of  "drainage"  in  the  sense  that  ideas 
move  through  these  points  and  regions.  We  test  this  on  a  corpus  of  Supreme  Court 
opinions  over  the  years  1951-2007  and  find  that  the  notions  of  puddling  and 


drainage  do  in  fact  achieve  the  stated  effects,  with  these  characteristics  showing 
opinions  to  be  either  5%  more  or  less  likely  to  have  future  impact  (in  a  sense  we 
make  rigorous)  depending  on  these  characterizations. 
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Figure  5.  On  the  left  is  a  histogram  of  the  curvatures  for  the  corpus  of  opinions  as  of  1990. 

Note  that  they  are  all  negative,  reflecting  a  generally  locally  hyperbolic  structure.  On  the  right  is  a 
distribution  of  the  bending  values,  computed  relative  to  initial  time  points  of  1995  and  1990. 1.e.,  this 
represents  the  way  in  which  the  curvature  changes  over  the  network  in  time.  Note  the  long  positive 
tail  reflecting  places  where  the  topics  become  vestigial,  and  the  shorter  left  "foot"  of  opinions  around 
topics  that  are  becoming  more  interesting  in  time  and  where  one  can  expect  a  good  deal  of  future 
work. 

E.  Link  prediction  in  networks 

•  "Multi-Task  Metric  Learning  on  Network  Data",  C.  Fang  D.  Rockmore, 
Advances  in  Knowledge  Discovery  and  Data  Mining,  Lecture  Notes  in 
Computer  Science  Volume  9077,  2015,  pp  317-329;  Accepted  to  Pacific-Asia 
Conference  on  Knowledge  Discovery  and  Data  Mining  (PAKDD)  2015; 
http://arxiv.org/abs/1411.2337 

•  "Sparse  Coding  for  Key  Node  Selection  over  Networks,"  Y.  Xu  and  D. 
Rockmore,  Discovery  Science.  Lecture  Notes  in  Computer  Science  Volume 
8777,  (2014),  pp  337-349 

In  this  work  we  consider  the  problem  of  link  prediction  in  networks.  In  the  first 
paper  cited  here  we  address  link  prediction  via  the  framework  of  multi-task  learning 
(MTL),  a  technique  that  has  been  shown  to  improve  prediction  performance  in  a 
number  of  different  contexts  by  learning  models  jointly  on  multiple  different,  but 
related  tasks.  The  proposed  approach  builds  on  structural  metric  learning  and 
intermediate  parameterization,  and  has  efficient  implementation  via  stochastic 
gradient  descent.  We  consider  two  common  real-world  applications:  citation 
prediction  for  Wikipedia  articles  and  social  circle  prediction  in  Google+.  The 
proposed  method  achieves  promising  results  and  exhibits  good  convergence 
behavior. 

In  the  second  paper  we  take  on  the  issue  that  the  size  of  networks  now  needed  to 
model  real  world  phenomena  poses  significant  computational  challenges.  We 
introduce  the  notion  on  key  node  selection  in  networks,  (KNSIN),  a  technique  for 
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discovering  a  (much  smaller)  representative  set  of  nodes  able  to  preserve  the  sketch 
of  the  network.  KNSIN  is  accomplished  via  a  sparse  coding  algorithm  that  efficiently 
learns  a  basis  set  over  the  feature  space  defined  by  the  nodes.  By  executing  a  stop 
criterion,  KNSIN  automatically  learns  the  dimensionality  of  the  node  space  and 
guarantees  that  the  learned  basis  accurately  preserves  the  sketch  of  the  original 
node  space.  We  demonstrate  its  effectiveness  on  experiments  on  two  large-scale 
network  datasets  we  demonstrate  the  effectiveness  of  the  KNSIN  algorithm. 

F.  Organizational  networks 

•  R.  L.  Lumsdaine,  D.  N.  Rockmore,  N.  Foti,  G.  Leibon,  and  ).  D.  Farmer,  The 
Intrafirm  Complexity  of  Systemically  Important  Financial  Institutions  (May  8, 
2015).  Presented  at  2015  SYRTO  Conference  on  Systemic  Risk.  Available  at 
SSRN:  http://ssrn.com/abstract=2604166  or 

http://dx.doi.org/10.2139/ssrn.2604166  (preparing  for  a  fall  2015  journal 
submission). 

Large  financial  organizations  are  both  part  of  complex  networks  (e.g.,  the  global 
financial  system)  as  well  as  being  networks  themselves.  In  November,  2011,  the 
Financial  Stability  Board,  in  collaboration  with  the  International  Monetary  Fund, 
published  a  list  of  29  "systemically  important  financial  institutions"  (SIFIs).  This 
designation  reflects  a  concern  that  the  failure  of  any  one  of  them  could  have 
dramatic  negative  consequences  for  the  global  economy  and  is  based  on  "their  size, 
complexity,  and  systemic  interconnectedness".  While  the  characteristics  of  "size" 
and  "systemic  interconnectedness"  have  been  the  subject  of  a  good  deal  of 
quantitative  analysis,  less  attention  has  been  paid  to  measures  of  a  firm’s 
"complexity."  In  this  paper  we  take  on  the  challenges  of  measuring  the  complexity  of 
a  financial  institution  by  exploring  the  use  of  the  structure  of  an  individual  firm’s 
control  hierarchy  as  a  proxy  for  institutional  complexity.  The  control  hierarchy  is  a 
network  representation  of  the  institution  and  its  subsidiaries.  We  show  that  this 
mathematical  representation  (and  various  associated  metrics)  provides  a  consistent 
way  to  compare  the  complexity  of  firms  with  often  very  disparate  business  models 
and  as  such  may  provide  the  foundation  for  determining  a  SIFI  designation.  By 
quantifying  the  level  of  complexity  of  a  firm,  our  approach  also  may  prove  useful 
should  firms  need  to  reduce  their  level  of  complexity  either  in  response  to  business 
or  regulatory  needs.  Using  a  data  set  containing  the  control  hierarchies  of  many  of 
the  designated  SIFIs,  we  find  that  between  2011  and  2013,  these  firms  have 
decreased  their  level  of  complexity,  perhaps  in  response  to  regulatory 
requirements. 


Figure  6.  The  tree  for  one  of  the  Systemically  Important  Financial  Institutions  and  its  various 
hierarchy  of  subsidiaries  (as  determined  by  its  control  hierarchy)  color-coded  according  to  the  three- 
digit  Standard  Industry  Classification  (SIC)  code  of  the  entities. 
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